-
Notifications
You must be signed in to change notification settings - Fork 25
Introduce 8da4w quant for decoder-only text models #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
92541cc to
602420d
Compare
a7369e6 to
c20bd3e
Compare
|
@kimishpatel @metascroy @jerryzh168 for review. |
797a747 to
c126e84
Compare
|
Tagging @tarun292 for review as we start adding quantization recipe for native HF models. |
c126e84 to
2c461b4
Compare
c650a1c to
a071afe
Compare
|
Fixed the executorch version check issue on Linux, which returns '0.6.0+cpu', causing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantization related code LGTM
ea01b8c to
aa83f06
Compare
a2de3dc to
484eeb9
Compare
484eeb9 to
d741f97
Compare
Initial efforts to introduce quantization to native Hugging Face models that are already supported in optimum-executorch. Start with decoder-only text models using "
8da4w" for linear layers and int8 for embedding.Experiment the quantization configs with the following models:
Qwen3-0.6Bgemma-3-1bHuggingFaceTB/SmolLM2-135MExample usage
via
optimum-clioptimum-cli export executorch --model Qwen/Qwen3-0.6B --task text-generation --recipe xnnpack --use_custom_sdpa --qlinear --qembedding --output_dir qwen3_8da4w_8weor use the
ExecuTorchModelForCausalLM.from_pretrained.et_model = ExecuTorchModelForCausalLM.from_pretrained("./qwen3_8da4w_8we").ptesize comparison: